[CI] Stabilize PostCommit Java ValidatesRunner Dataflow Streaming workflow#38753
[CI] Stabilize PostCommit Java ValidatesRunner Dataflow Streaming workflow#38753durgaprasadml wants to merge 5 commits into
Conversation
…kflow - throttle validatesRunner parallelism - migrate workflow to Streaming Engine - add metrics-driven streaming termination - delay transient JOB_MESSAGE_ERROR cancellations - add Gradle test retry support - add streaming runner tests Fixes apache#38710
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of stability improvements for the PostCommit Java ValidatesRunner Dataflow Streaming workflow. By addressing resource contention through concurrency throttling, migrating to the more efficient Streaming Engine, and implementing smarter, metrics-driven job termination, the changes aim to drastically reduce CI flakiness and improve the reliability of the testing infrastructure. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the org.gradle.test-retry plugin to automatically retry failed tests on CI, throttles default test parallelism in the Dataflow runner to prevent quota exhaustion, and implements early success/failure cancellation for streaming jobs in TestDataflowRunner by polling metrics asynchronously. The review feedback highlights critical robustness improvements for the background monitoring thread to handle transient API exceptions without silently terminating, and suggests updating the new unit tests to explicitly verify asynchronous job cancellation using Mockito's timeout verification.
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Summary
This PR stabilizes the PostCommit Java ValidatesRunner Dataflow Streaming workflow, which is currently failing more than 50% of the time due to infrastructure contention, legacy streaming runner limitations, and overly aggressive streaming job cancellation behavior.
Fixes #38710
Root Causes Identified
1. Unbounded Parallelism / Resource Exhaustion
The validatesRunner tasks were configured with:
groovy id="n9b5m2" maxParallelForks Integer.MAX_VALUE
Combined with GitHub Actions max-workers: 12, this could launch up to 12 concurrent Dataflow streaming jobs simultaneously.
This frequently exhausted:
leading to worker startup starvation and test timeouts.
2. Legacy Streaming Worker Non-Termination
The workflow previously used the legacy VM-based streaming execution path:
bash id="ewl1pu" :runners:google-cloud-dataflow-java:validatesRunnerStreaming
Bounded streaming pipelines under the legacy runner often failed to terminate automatically, remaining in RUNNING state until the 15-minute timeout cancelled them.
3. Aggressive Failure Cancellation
TestDataflowRunner immediately cancelled jobs upon encountering any JOB_MESSAGE_ERROR, even for transient worker/network issues that Dataflow could automatically recover from.
This caused false-negative failures in CI.
Changes Implemented
Throttle validatesRunner concurrency
Reduced validatesRunner concurrency to:
groovy id="pvb0u9" maxParallelForks = 4
with support for overriding via:
bash id="6a4s1z" -PmaxParallelForks=
This reduces quota pressure and runner overload.
Migrate workflow to Streaming Engine
Updated the workflow to run:
bash id="0kcg4q" :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine
Benefits:
Add metrics-driven early termination
Enhanced TestDataflowRunner to continuously poll:
during streaming execution.
Behavior:
This reduces successful test runtime from ~15 minutes to ~2–3 minutes.
Delay cancellation on transient worker errors
Added a recovery window before cancelling jobs due to transient JOB_MESSAGE_ERROR entries, allowing Dataflow retries and self-healing to stabilize the pipeline.
Add Gradle test retry support
Integrated the org.gradle.test-retry plugin for CI integration tests to reduce transient infrastructure-related failures.
Validation
Added/updated tests covering:
Verification command:
bash id="j7bw3w" ./gradlew :runners:google-cloud-dataflow-java:test \ --tests "org.apache.beam.runners.dataflow.TestDataflowRunnerTest"
Streaming validation command:
bash id="o3llzt" ./gradlew :runners:google-cloud-dataflow-java:validatesRunnerStreamingEngine \ -PtestFilter="org.apache.beam.sdk.transforms.GroupByKeyTest"
Expected Impact
These changes are expected to: